Skip to content

Add Parquet variant shredding support#328

Closed
CurtHagenlocher wants to merge 1 commit intoapache:mainfrom
CurtHagenlocher:VariantShredding
Closed

Add Parquet variant shredding support#328
CurtHagenlocher wants to merge 1 commit intoapache:mainfrom
CurtHagenlocher:VariantShredding

Conversation

@CurtHagenlocher
Copy link
Copy Markdown
Contributor

What's Changed

Implements the Parquet variant shredding spec end-to-end in a new Apache.Arrow.Operations.Shredding namespace, alongside minor changes to the base scalar and array types.

Operations.Shredding reader side:

  • ShreddedVariant / ShreddedObject / ShreddedArray ref-struct trio exposing typed columns and residual bytes side-by-side.
  • VariantArrayShreddingExtensions adds GetShreddedVariant(i) and GetLogicalVariantValue(i) on VariantArray.
  • ShredSchema.FromArrowType derives a shredding schema from an Arrow typed_value type, rejecting unsupported types (uint32, fixed-size-binary(N≠16)).

Operations.Shredding producer side:

  • VariantShredder decomposes a column of VariantValues against a ShredSchema into shared metadata + per-row ShredResults.
  • ShreddedVariantArrayBuilder assembles those into a shredded VariantArray with a typed_value Arrow tree matching the schema.

Apache.Arrow changes:

  • VariantExtensionDefinition accepts struct<metadata, value?, typed_value?> layouts in addition to the plain unshredded form.

  • VariantType gains IsShredded / HasValueColumn / HasTypedValueColumn / TypedValueField properties.

  • VariantArray.GetVariantValue and GetVariantReader throw on shredded columns with a pointer to the Operations.Shredding extensions.

  • The public VariantArray(IArrowArray) constructor now infers the VariantType (shredded or not) from the storage shape.

  • Operations gains a project reference to Apache.Arrow; Apache.Arrow does not reference Operations.

    Apache.Arrow.Scalars changes:

  • VariantValueWriter.CopyValue(VariantReader source) transcodes a reader into this writer, re-resolving field IDs against the writer's metadata dictionary. Supports cross-dictionary transcoding and multi-source merge-into-one-dictionary workflows.

  • VariantMetadataBuilder.CollectFieldNames(VariantReader source) is the two-pass companion that accumulates source field names into the target metadata builder.

Validation:

  • Conformance tests run against the Iceberg shredded-variant corpus in apache/parquet-testing (test/parquet-testing/shredded_variant/). test/shredded_variant_ipc/regen.py converts each case-NNN.parquet to an Arrow IPC file via pyarrow; 137 resulting .arrow files are checked in so CI needs no Python. All 128 valid conformance cases pass; 6 schema-invalid and data-invalid cases are rejected with clear errors; 3 "spec-invalid but permissive" INVALID cases are documented as read-without-throw.
  • Additional round-trip, reader-style, and builder tests were implemented

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end Parquet “variant shredding” support to the Arrow .NET operations layer, plus supporting scalar helpers and a checked-in IPC conformance corpus so CI can validate shredded-variant behavior without requiring a Parquet reader or Python.

Changes:

  • Adds Apache.Arrow.Operations.Shredding types/helpers and options/enums for shredded-variant typed_value handling.
  • Adds VariantValueWriter.CopyValue(VariantReader) and VariantMetadataBuilder.CollectFieldNames(VariantReader) to support cross-dictionary transcoding workflows.
  • Adds test/shredded_variant_ipc IPC fixtures (and a regen script) for conformance testing.

Reviewed changes

Copilot reviewed 28 out of 165 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj Adds Operations → Apache.Arrow project reference needed by shredding types.
src/Apache.Arrow.Operations/Shredding/ShreddingHelpers.cs Internal helper to construct per-row ShreddedVariant slots from element-group structs.
src/Apache.Arrow.Operations/Shredding/ShredOptions.cs Public options for shredding schema inference.
src/Apache.Arrow.Operations/Shredding/ShredType.cs Enum describing typed_value expectations for shredded variant columns.
src/Apache.Arrow.Scalars/Variant/VariantMetadataBuilder.cs Adds recursive field-name collection to support 2-pass metadata + value encoding.
src/Apache.Arrow.Scalars/Variant/VariantValueWriter.cs Adds CopyValue/CopyPrimitive to transcode from a VariantReader into a writer.
test/shredded_variant_ipc/regen.py Script to regenerate IPC fixtures from the parquet-testing shredded_variant corpus.
test/shredded_variant_ipc/case-001.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-002.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-004.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-005.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-006.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-007.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-008.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-009.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-010.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-011.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-012.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-013.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-014.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-015.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-016.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-017.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-018.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-019.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-020.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-021.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-022.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-023.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-024.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-025.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-026.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-027.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-028.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-029.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-030.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-031.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-032.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-033.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-034.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-035.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-036.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-037.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-038.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-039.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-040.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-041.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-042.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-043-INVALID.arrow IPC fixture for shredded-variant conformance (invalid case).
test/shredded_variant_ipc/case-044.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-045.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-046.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-047.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-048.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-049.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-050.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-051.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-052.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-053.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-054.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-055.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-056.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-057.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-058.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-059.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-060.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-061.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-062.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-063.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-064.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-065.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-066.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-067.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-068.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-069.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-070.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-071.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-072.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-073.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-074.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-075.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-076.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-077.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-078.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-079.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-080.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-081.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-082.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-083.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-084-INVALID.arrow IPC fixture for shredded-variant conformance (invalid case).
test/shredded_variant_ipc/case-085.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-086.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-087.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-088.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-089.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-090.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-091.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-092.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-093.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-094.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-095.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-096.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-097.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-098.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-099.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-100.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-101.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-102.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-103.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-104.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-105.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-106.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-107.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-108.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-109.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-110.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-111.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-112.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-113.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-114.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-115.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-116.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-117.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-118.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-119.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-120.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-121.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-122.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-123.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-124.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-125-INVALID.arrow IPC fixture for shredded-variant conformance (invalid case).
test/shredded_variant_ipc/case-126.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-127.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-128.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-129.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-130.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-131.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-132.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-133.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-134.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-135.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-136.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-137.arrow IPC fixture for shredded-variant conformance.
test/shredded_variant_ipc/case-138.arrow IPC fixture for shredded-variant conformance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

public double MinTypeConsistency { get; set; } = 0.8;

/// <summary>Default options.</summary>
public static readonly ShredOptions Default = new ShredOptions();
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShredOptions.Default is a mutable static instance (the type has settable properties). Any consumer that does ShredOptions.Default.MaxDepth = ... will mutate global state for the entire process, which is easy to do accidentally and hard to debug. Consider making Default return a new instance each time (e.g., => new ShredOptions()), making the options immutable (init-only), or exposing a CreateDefault() factory instead.

Suggested change
public static readonly ShredOptions Default = new ShredOptions();
public static ShredOptions Default => new ShredOptions();

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +65
src_path = os.path.join(src, pf)
if not os.path.exists(src_path):
continue

Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regen.py silently skips a Parquet file when src_path doesn't exist (continue). This can lead to an incomplete regenerated IPC corpus without any signal (e.g., if the parquet-testing subtree isn't fully checked out or a filename in cases.json changes). Consider failing fast or at least emitting a warning/error to stderr when a listed Parquet file is missing, and optionally track a nonzero exit code if any were skipped.

Copilot uses AI. Check for mistakes.
Comment on lines +503 to +507
case VariantPrimitiveType.TimestampTzNanos: WriteTimestampTzNanos(source.GetTimestampTzNanos()); return;
case VariantPrimitiveType.TimestampNtzNanos: WriteTimestampNtzNanos(source.GetTimestampNtzNanos()); return;
case VariantPrimitiveType.String: WriteString(source.GetString()); return;
case VariantPrimitiveType.Binary: WriteBinary(source.GetBinary().ToArray()); return;
case VariantPrimitiveType.Uuid: WriteUuid(source.GetUuid()); return;
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyPrimitive allocates for binary values via source.GetBinary().ToArray() because WriteBinary only accepts byte[]. For large variants or bulk transcoding this can be a significant overhead. Consider adding a WriteBinary(ReadOnlySpan<byte>)/WriteBinary(ReadOnlyMemory<byte>) overload (or equivalent) and using it here to avoid the intermediate allocation.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +43
StructType elementGroupType = (StructType)elementGroup.Data.DataType;
int valueIdx = elementGroupType.GetFieldIndex("value");
int typedIdx = elementGroupType.GetFieldIndex("typed_value");

IArrowArray valueArr = valueIdx >= 0 ? elementGroup.Fields[valueIdx] : null;
IArrowArray typedArr = typedIdx >= 0 ? elementGroup.Fields[typedIdx] : null;
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BuildSlot calls StructType.GetFieldIndex("value") / GetFieldIndex("typed_value") on every invocation. This method is a linear scan (StructType.cs even notes caching if on a hot path), and BuildSlot is used inside per-element loops in ShreddedArray/ShreddedObject. Consider caching these field indices once per element-group StructType (or passing them in) to avoid repeated linear lookups.

Copilot uses AI. Check for mistakes.
@CurtHagenlocher CurtHagenlocher marked this pull request as draft April 25, 2026 03:13
@CurtHagenlocher
Copy link
Copy Markdown
Contributor Author

I need to refactor VariantValueWriter before doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants